Twitter US Airline Sentiment Analysis

Description

Background and Context:

Twitter possesses 330 million monthly active users, which allows businesses to reach a broad population and connect with customers without intermediaries. On the other hand, there’s so much information that it’s difficult for brands to quickly detect negative social mentions that could harm their business.

That's why sentiment analysis/classification, which involves monitoring emotions in conversations on social media platforms, has become a key strategy in social media marketing.

Listening to how customers feel about the product/service on Twitter allows companies to understand their audience, keep on top of what’s being said about their brand and their competitors, and discover new trends in the industry.

Data Description

A sentiment analysis job about the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service").

Data

Data Dictionary

The dataset has the following columns:

Objective

Implement a natural language processing model that pre process data, build bag of words & vocabulary and predict customer sentiment for the airline service

Solution Approach - Sentiment_Analysis with NLP

Twitter US Airline Sentiment Analysis

Understand Given Data

Load given data Tweets.csv to data frame and understand data, data type, data nature, features incuded, total records,data has any missing values or duplicate data, outliers.

Visualize data and and understand data range and detect outliers

Loading necessary libraries for EDA

Load all standard python library packages.

Data Manipulation

Data Visualization

Install style & auto time libraries

Install contractions for NLP text-pre processing

Importing required libraries

Data Summary

Download Data & setup data

Download data from google drive & save data to dataframe

View the first and last 5 rows of the dataset.

Understand the shape of the dataset.

Observations on data

Exploratory Data Analysis (EDA)

Exploratory data analysis is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods.

Checking for missing values

lets check which columns has any null values and count of null values

Observation on missing data

Most values missing are for negative reasons and wont exists for postive and neutral sentiments

None of these fields with missing values will impact our sentiment analysis so no fix required to fill missing values.

Checking for Duplicates Values

lets check for any duplicate values

Given Data has 36 duplicate records. These duplicate records will not impact sentiment analysis so we dont required to remove diplicate records

Top 10 value(s) & value count(s)

Observation on features top 10 values

Remove features not required for Intial Analysis

Removing Columns

Check the data types of the columns in the dataset.

checking data types and data summary of all columns

Check data summary

All numerical features & summary

Observation data

After removing feature not required for initial analysis we have 7 features

Observation on missing data

Only 2 feature has some missing values because those are negative sentiment related features. Postive and Neutral sentiment will not have any values.

Dataset with only required features

View the first and last 5 rows of the dataset.

Univariate analysis & Bivariate analysis

Visualize all features before any data clean up and understand what data needs cleaning and fixing.

Analysis on features

Univariate analysis helps to check data skewness and possible outliers and spread of the data. Bivariate analysis helps to check data relation between two features.

Creating common methods that can plot univariate chart with histplot, boxplot and barchart %

Analysis on airline_sentiment

Observations on airline_sentiment

Analysis on airline_sentiment_confidence

Observations on airline_sentiment_confidence

Analysis on negativereason

Observations on Negative reason

Observations on Negative reason by Airlines

Analysis on negativereason_confidence

Observations on negativereason_confidence

Analysis on airline

Observations on airline

Analysis on retweet_count

Observations on retweet_count

Analysis on text

Generate Tweet Length and words count

Observations on text

Critical feature for sentiment analysis

  1. Negative tweets has more words and tweet length compared to postive and neutral
  2. On Avg Every tweet has 20 words and above for negative tweets
  3. Postive and Neutral has avg 10 to 15 words
  4. Only Negtive tweets has outliers in word count, Some negative tweet has only 2 words

Word cloud graph of tweets for positive, negative & neutral sentiment

Create custom stopwords list, Removing some words related to negative sentiment. These negative stopwords will help to identify tweet sentiment later.

Total we have 146 stopwords

Plot Wordcloud Features

Common method to plot Wordcloud since we reuse this Wordcloud feature multiple times later

Wordcloud - From All tweets

Observation - Wordcloud - From All tweets

Total words - 14640

Top words we see in word cloud are airline names, flight, not, get, cancelled, service, thanks.

Wordcloud - Negative Tweets

Observation - Wordcloud - From negative tweets

Total negative tweet words - 9178

Top words we see in word cloud are airline names, flight, not, get, cancelled, service, hold.

Wordcloud - Positive Tweets

Observation - Wordcloud - From positive tweets

Total positive tweet words - 2363

Top words we see in word cloud are airline names, flight, thanks, thank, great, love

Wordcloud - Neutral Tweets

Observation - Wordcloud - From neutral tweets

Total neutral tweet words - 3099

Top words we see in word cloud are airline names, flight, get, need, please not, help

Data correlation analysis - Heatmap

lets check how each numerical features are related

Key Observations on Data correlation

Pair Plot analysis

Key Observations on Pair Plot

Insights based on EDA

Observation on missing data

Most values missing are for negative reasons and wont exists for postive and neutral sentiments

None of these fields with missing values will impact our sentiment analysis so no fix required to fill missing values.

Removed features before EDA

After removing feature not required for initial analysis we have 7 features

Observations on airline_sentiment

Observations on airline_sentiment_confidence

Observations on Negative reason

Observations on Negative reason by Airlines

Observations on negativereason_confidence

Observations on retweet_count

Observations on text

Critical feature for sentiment analysis

  1. Negative tweets has more words and tweet length compared to postive and neutral
  2. On Avg Every tweet has 20 words and above for negative tweets
  3. Postive and Neutral has avg 10 to 15 words
  4. Only Negtive tweets has outliers in word count, Some negative tweet has only 2 words

Observation - Wordcloud - From All tweets

Total words - 14640

Top words we see in word cloud are airline names, flight, not, get, cancelled, service, thanks.

Observation - Wordcloud - From negative tweets

Total negative tweet words - 9178

Top words we see in word cloud are airline names, flight, not, get, cancelled, service, hold.

Observation - Wordcloud - From positive tweets

Total positive tweet words - 2363

Top words we see in word cloud are airline names, flight, thanks, thank, great, love

Observation - Wordcloud - From neutral tweets

Total neutral tweet words - 3099

Top words we see in word cloud are airline names, flight, get, need, please not, help

Key Observations on Pair Plot

Data Cleaning - Text Pre-processing

Clean Tweet text before processing & preating Train & Test data for NLP Models

Text Pre-processing:

Remove html tags from data

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree.

Replace all contractions from data

A Python library for expanding and creating common English contractions in text. This is very useful for dimensionality reduction by normalizing the text before generating word or character vectors. It performs contraction by simple replacement rules of the commonly used English contractions.

I'd -> I would

I'd -> I had

Remove numbers from data

Remove numbers to text data

Remove all @ mentions from data

Remove All tweet account names, this is not critical for sentiment analysis and from world clound we see account names are top words in all sentiments

Remove all non ascii from data

Remove non-ASCII characters from list of tokenized words

Convert data to lowercase

Convert all characters to lowercase from list of tokenized words

Remove punctuation from list of tokenized words

Remove punctuation from list of tokenized words

Remove stop words from list of tokenized words

Remove stop words from list of tokenized words

Lemmatization with NLTK

Lemmatization is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item. Lemmatization is similar to stemming but it brings context to the words. So it links words with similar meanings to one word.

Data Prepration

Prepare data before building model. Use text feature and apply all pre text processing methods one by one and clean data

Drop Columns

Drop columns not required for sentiment analysis

Keeping only text & airline_sentiment

Understand the shape of the dataset.

Before data text pre processing

View the first and last 5 rows of the dataset.

Data Cleaning Pipeline

Chain all methods one by one and clean tweet data

Clean Tweet text data

Create new feature cleaned_tweet from existing tweet text feature and apply text pre processing.

Create new feature airline_sentiment_label from existing airline_sentiment feature, Uses Numerical encoding for sentiment lablels

After data text pre processing

Check Data before and after cleaning

Observation on data cleaning

We can see text preprocessing applied and cleaned_tweet has only core text required for sentiment analysis

Data Preparation for Modeling

CountVectorizer is used to convert a collection of text documents to a vector of term/token counts. It also enables the ​pre-processing of text data prior to generating the vector representation. This functionality makes it a highly flexible feature representation module for text.

TF-IDF (term frequency-inverse document frequency) is a statistical measure that evaluates how relevant a word is to a document in a collection of documents.

Sentiment Keys after one hot encoding

Splitting data into training and test

Split data for training & validation. We use validation data to evaluate model performance

Bag of Words (CountVectorizer)

CountVectorizer is used to convert a collection of text documents to a vector of term/token counts. It also enables the ​pre-processing of text data prior to generating the vector representation. This functionality makes it a highly flexible feature representation module for text.

Prepare training and test data

Term Frequency(TF) - Inverse Document Frequency(IDF)

TF-IDF (term frequency-inverse document frequency) is a statistical measure that evaluates how relevant a word is to a document in a collection of documents.

Prepare training and test data

Building Natural Language Processing Models

lets build differnt NLP models using Count Vectorizer data and Tfidf Vectorizer data

Compare all model performance & fine tune for more improvements

Methods to Plot performance scores & Plot Confusion Matrix

Common method to plot model scores & confusion matrix for all models

Unsupervised Learning methods for Sentiment Analysis

Model 1 - Vader Sentiment Analysis

VADER (Valence Aware Dictionary and Sentiment Reasoner) is a lexicon and rule-based sentiment analysis tool. VADER not only tells us about the Positivity and Negativity score, but also tells us how positive or negative a sentiment is.

Vader sentiment on Original tweet text

Using orignal text and predict sentiment

Observation on Vader sentiment with original tweet

Vader sentiment on cleaned text tweet

Using cleaned text and predict sentiment

Observation on Vader sentiment with cleaned tweet

This model did not perform well. Very low accuracy scores.

Observation on Vader Sentiment Analysis

Vader sentiment analysis on both orignal text and cleaned text did not perform well. Very low accuracy scores.

Supervised Learning methods for Sentiment Analysis

Model 1 - Naive Bayes Multi Class - Sentiment Analysis - with Count Vectorizer features

Observations on Naive Bayes Multi Class - Sentiment Analysis - with Count Vectorizer features

Naive Bayes Multi Class - Sentiment Analysis - with Tfidf Vectorizer features

Observations on Naive Bayes Multi Class - Sentiment Analysis - with Tfidf Vectorizer features

Model 2 - Using Random Forest to build model for the classification of reviews. - Sentiment Analysis - with Count Vectorizer features

Observation - Random Forest - Sentiment Analysis - with Count Vectorizer features

Model 2 - Using Random Forest to build model for the classification of reviews. - Sentiment Analysis - with TD-IDF Vectorizer features

Observation - Random Forest - Sentiment Analysis - with IF-IDF Vectorizer features

Model 3 - NLP With TensorFlow/Keras

Common method to plat model performance, loss and accuracy scores for tensorflow models

Import all required tensor flow Libraries

Creating custom call backs to log accuracy scores, adjust learning rate for every 3 epoch

Encode sentiment lables using to_categorical to use with tensorflow models

Model 2 - NLP With TensorFlow/Keras - with Count Vectorizer features

Build input features with Tensorlow Keras Layers TextVectorization

Count Vectorizer - output_mode:count - "count": Like "multi_hot", but the int array contains a count of the number of times the token at that index appeared in the batch item.

Create Model

Train Model

Observations - NLP With TensorFlow/Keras - - with Count Vectorizer features

Model 2 - NLP With TensorFlow/Keras - with Tf_Idf Vectorizer features

Build input features with Tensorlow Keras Layers TextVectorization

Tf_idf Vectorizer - output_mode:tf_idf - "tf_idf": Like "multi_hot", but the TF-IDF algorithm is applied to find the value in each token slot. For "int" output, any shape of input and output is supported. For all other output modes, currently only rank 1 inputs (and rank 2 outputs after splitting) are supported.

Create Model

Train Model

Observations - NLP With TensorFlow/Keras - - with TF-IDF Vectorizer features

Check Performance Scores for all models

Check Accuracy scores for all models and pick the best model for Futher Performance Tuning

Observations on Model performance

Tensorflow NN NLP Model scores are good but both feature models overfit lot.

Lets optimize model futher

Keras Model - Performance Tuning & Avoid Overfitting

and achieve final accuracy score >= 80% both for training & validation

Adding Features to avoid overfitting

  1. Weight Regularizers
  2. Dropouts
  3. Adding Multiple hidden layers
  4. Dynamic Learning Rate
  5. Early Stopping to avoid overfitting

Method to create model using dynamic parameters, so we can run multiple experiments

Method to create model using dynamic parameters, so we can run multiple experiments

Experiment 1

No weight regularization or Dropouts

Observation on Experiment 1

No weight regularization or Dropouts - Overfit

Experiment 2

With weight regularization and Dropouts

With weight regularization and Dropouts

we achieved target accuracy score >= 80% both for training & validation

Top 40 Features WordCloud

Count vectorizer - Top 40 Features

Observations on Count vectorizer - Top Features

Top words on Count vectorizer are thank, great, not, cancel, flight

TF-IDF vectorizer - Top 40 Features

Observations on TF-IDF vectorizer - Top Features

Top words on Count vectorizer are thank, great, not, cancel, flight, love, delay

Summary

Observation - From negative tweets

Total negative tweet words - 9178

Top words we see in word cloud are airline names, flight, not, get, cancelled, service, hold.

Observation - From positive tweets

Total positive tweet words - 2363

Top words we see in word cloud are airline names, flight, thanks, thank, great, love

Observation - From neutral tweets

Total neutral tweet words - 3099

Top words we see in word cloud are airline names, flight, get, need, please not, help

Critical feature for sentiment analysis

  1. Negative tweets has more words and tweet length compared to postive and neutral
  2. On Avg Every tweet has 20 words and above for negative tweets
  3. Postive and Neutral has avg 10 to 15 words

Observations on Negative reason by Airlines